from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly
import plotly.express as px
from mpl_toolkits.mplot3d import Axes3D
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
plotly.offline.init_notebook_mode()
iris = load_iris(as_frame=True)
print(iris)
{'data': sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
.. ... ... ... ...
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8
[150 rows x 4 columns], 'target': 0 0
1 0
2 0
3 0
4 0
..
145 2
146 2
147 2
148 2
149 2
Name: target, Length: 150, dtype: int32, 'frame': sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2 \
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
.. ... ... ... ...
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8
target
0 0
1 0
2 0
3 0
4 0
.. ...
145 2
146 2
147 2
148 2
149 2
[150 rows x 5 columns], 'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'), 'DESCR': '.. _iris_dataset:\n\nIris plants dataset\n--------------------\n\n**Data Set Characteristics:**\n\n :Number of Instances: 150 (50 in each of three classes)\n :Number of Attributes: 4 numeric, predictive attributes and the class\n :Attribute Information:\n - sepal length in cm\n - sepal width in cm\n - petal length in cm\n - petal width in cm\n - class:\n - Iris-Setosa\n - Iris-Versicolour\n - Iris-Virginica\n \n :Summary Statistics:\n\n ============== ==== ==== ======= ===== ====================\n Min Max Mean SD Class Correlation\n ============== ==== ==== ======= ===== ====================\n sepal length: 4.3 7.9 5.84 0.83 0.7826\n sepal width: 2.0 4.4 3.05 0.43 -0.4194\n petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)\n petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)\n ============== ==== ==== ======= ===== ====================\n\n :Missing Attribute Values: None\n :Class Distribution: 33.3% for each of 3 classes.\n :Creator: R.A. Fisher\n :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n :Date: July, 1988\n\nThe famous Iris database, first used by Sir R.A. Fisher. The dataset is taken\nfrom Fisher\'s paper. Note that it\'s the same as in R, but not as in the UCI\nMachine Learning Repository, which has two wrong data points.\n\nThis is perhaps the best known database to be found in the\npattern recognition literature. Fisher\'s paper is a classic in the field and\nis referenced frequently to this day. (See Duda & Hart, for example.) The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant. One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\n.. topic:: References\n\n - Fisher, R.A. "The use of multiple measurements in taxonomic problems"\n Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n Mathematical Statistics" (John Wiley, NY, 1950).\n - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.\n (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.\n - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\n Structure and Classification Rule for Recognition in Partially Exposed\n Environments". IEEE Transactions on Pattern Analysis and Machine\n Intelligence, Vol. PAMI-2, No. 1, 67-71.\n - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions\n on Information Theory, May 1972, 431-433.\n - See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II\n conceptual clustering system finds 3 classes in the data.\n - Many, many more ...', 'feature_names': ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], 'filename': 'iris.csv', 'data_module': 'sklearn.datasets.data'}
type(iris)
sklearn.utils._bunch.Bunch
iris.keys()
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
iris['feature_names']
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
iris['target_names']
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')
iris_df = iris['frame']
iris_df
| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | 0 |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 |
| ... | ... | ... | ... | ... | ... |
| 145 | 6.7 | 3.0 | 5.2 | 2.3 | 2 |
| 146 | 6.3 | 2.5 | 5.0 | 1.9 | 2 |
| 147 | 6.5 | 3.0 | 5.2 | 2.0 | 2 |
| 148 | 6.2 | 3.4 | 5.4 | 2.3 | 2 |
| 149 | 5.9 | 3.0 | 5.1 | 1.8 | 2 |
150 rows × 5 columns
# Review thr top 5 records of the data frame
iris_df.head()
| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | 0 |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 |
iris_df['target'] = np.where(iris_df['target'] == 2, 'virginica', 'non-virginica')
iris_df.head()
| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | non-virginica |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | non-virginica |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | non-virginica |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | non-virginica |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | non-virginica |
# Review information like names of columns, data type and total records
df_info = iris_df.info()
print(df_info)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 sepal length (cm) 150 non-null float64 1 sepal width (cm) 150 non-null float64 2 petal length (cm) 150 non-null float64 3 petal width (cm) 150 non-null float64 4 target 150 non-null object dtypes: float64(4), object(1) memory usage: 6.0+ KB None
# Check if there are any null values in dataset
df_is_null = iris_df.isnull().sum()
print(df_is_null)
sepal length (cm) 0 sepal width (cm) 0 petal length (cm) 0 petal width (cm) 0 target 0 dtype: int64
# descriptive statistics for each of the two classes.
df_stats = iris_df.groupby('target').describe()
df_stats
| sepal length (cm) | sepal width (cm) | ... | petal length (cm) | petal width (cm) | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 25% | 50% | 75% | max | count | mean | ... | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
| target | |||||||||||||||||||||
| non-virginica | 100.0 | 5.471 | 0.641698 | 4.3 | 5.000 | 5.4 | 5.9 | 7.0 | 100.0 | 3.099 | ... | 4.325 | 5.1 | 100.0 | 0.786 | 0.565153 | 0.1 | 0.2 | 0.8 | 1.3 | 1.8 |
| virginica | 50.0 | 6.588 | 0.635880 | 4.9 | 6.225 | 6.5 | 6.9 | 7.9 | 50.0 | 2.974 | ... | 5.875 | 6.9 | 50.0 | 2.026 | 0.274650 | 1.4 | 1.8 | 2.0 | 2.3 | 2.5 |
2 rows × 32 columns
plt.figure(figsize=(15, 10))
for i, col in enumerate(iris_df.columns, start=1):
# Skip the 'target' column
if col == 'target':
continue
# Create a subplot for each column
plt.subplot(2, 2, i)
# Create the histogram with hue='target'
sns.histplot(data=iris_df, x=col, hue='target', kde=True)
# Set the title
plt.title('Histogram of {}'.format(col))
# Show the plot
plt.tight_layout()
plt.show()
corr_matrix = iris_df.drop('target', axis=1).corr()
plt.figure(figsize=(10,8))
sns.heatmap(corr_matrix, annot=True, cmap='Blues', fmt=".2f")
plt.title('Correlation Matrix')
plt.show()
Shows the bivariate relation between each pair of features
Reference : Python Data Visualizations
sns.pairplot(iris_df, hue="target", size=3 , diag_kind="kde")
d:\Jency\Class Work\AI and Machine Learning\Foundations of Machine Learning Frameworks\CSCN8010\venv\CSCN8010_classic_ml\Lib\site-packages\seaborn\axisgrid.py:2095: UserWarning: The `size` parameter has been renamed to `height`; please update your code.
<seaborn.axisgrid.PairGrid at 0x1f7a0918d90>
Shows the distribution of data and skewness
Reference : Python Data Visualizations
iris_df.boxplot(by="target", figsize=(12, 6))
array([[<Axes: title={'center': 'petal length (cm)'}, xlabel='[target]'>,
<Axes: title={'center': 'petal width (cm)'}, xlabel='[target]'>],
[<Axes: title={'center': 'sepal length (cm)'}, xlabel='[target]'>,
<Axes: title={'center': 'sepal width (cm)'}, xlabel='[target]'>]],
dtype=object)
Shows the 3 dimensional view of 3 features
Reference : 3D plots
fig = px.scatter_3d(iris_df, x='sepal length (cm)', y='sepal width (cm)', z='petal length (cm)',
color='target')
fig.show()
X = iris_df.drop(['target'], axis=1)
y = iris_df[["target"]] == "virginica"
train_ratio = .81
test_ratio = .10
validation_ratio = .10
models = {}
for i in range(1, 5):
X_i = X.iloc[:, :i]
# Split the data into a train test and validation set
X_train, X_test, y_train, y_test = train_test_split(X_i, y.values.ravel(), test_size=test_ratio, random_state=42)
X_train, X_val , y_train, y_val = train_test_split(X_train, y_train, test_size=validation_ratio/(train_ratio+test_ratio), random_state=42)
# Create and train the logistic regression model with 'i' features
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train.ravel())
models[i] = model
print(models)
{1: LogisticRegression(random_state=42), 2: LogisticRegression(random_state=42), 3: LogisticRegression(random_state=42), 4: LogisticRegression(random_state=42)}
metrics = []
for i, model in models.items():
X_val_i = X_val.iloc[:, :i]
# Use the model to predict the probabilities and the classes
proba_val = model.predict_proba(X_val_i)[:, 1] # Probability of 'virginica'
pred_val = model.predict(X_val_i) # Predicted class
print(f'\nModel : {model} with {i} features')
table = pd.DataFrame({
'Instance number': X_val_i.index ,
'Probability of virginica': proba_val,
'Model prediction': pred_val,
'Ground truth': y_val})
print(table)
print(f"Summary of the model with {i} features on the validation set:")
# Calculate the accuracy of the model on the test set
accuracy_val = np.mean(pred_val == y_val)
print(f"Accuracy : {accuracy_val}")
# Calculate the log loss
logloss_val = log_loss(y_val, proba_val)
print(f"Log loss of the model : {logloss_val}")
metrics.append({'Model': f'Model Feature={i}',
'Log-Loss': logloss_val})
Model : LogisticRegression(random_state=42) with 1 features
Instance number Probability of virginica Model prediction Ground truth
0 115 0.576720 True True
1 35 0.064688 False False
2 91 0.418391 False False
3 42 0.018915 False False
4 40 0.064688 False False
5 53 0.167034 False False
6 143 0.761502 True True
7 88 0.198786 False False
8 87 0.524087 True False
9 149 0.319692 False True
10 60 0.064688 False False
11 67 0.275261 False False
12 86 0.720719 True False
13 139 0.797999 True True
14 1 0.052939 False False
Summary of the model with 1 features on the validation set:
Accuracy : 0.8
Log loss of the model : 0.3832909328304838
Model : LogisticRegression(random_state=42) with 2 features
Instance number Probability of virginica Model prediction Ground truth
0 115 0.527268 True True
1 35 0.054036 False False
2 91 0.413153 False False
3 42 0.015733 False False
4 40 0.041985 False False
5 53 0.267725 False False
6 143 0.722770 True True
7 88 0.195875 False False
8 87 0.666397 True False
9 149 0.315295 False True
10 60 0.141545 False False
11 67 0.326790 False False
12 86 0.697266 True False
13 139 0.778828 True True
14 1 0.052245 False False
Summary of the model with 2 features on the validation set:
Accuracy : 0.8
Log loss of the model : 0.42936650538756177
Model : LogisticRegression(random_state=42) with 3 features
Instance number Probability of virginica Model prediction Ground truth
0 115 0.779572 True True
1 35 0.000005 False False
2 91 0.290399 False False
3 42 0.000009 False False
4 40 0.000006 False False
5 53 0.090373 False False
6 143 0.959330 True True
7 88 0.082049 False False
8 87 0.225165 False False
9 149 0.706750 True True
10 60 0.025216 False False
11 67 0.090214 False False
12 86 0.304164 False False
13 139 0.814903 True True
14 1 0.000012 False False
Summary of the model with 3 features on the validation set:
Accuracy : 1.0
Log loss of the model : 0.14023623382003317
Model : LogisticRegression(random_state=42) with 4 features
Instance number Probability of virginica Model prediction Ground truth
0 115 0.901372 True True
1 35 0.000003 False False
2 91 0.212009 False False
3 42 0.000004 False False
4 40 0.000003 False False
5 53 0.079641 False False
6 143 0.976464 True True
7 88 0.061013 False False
8 87 0.169502 False False
9 149 0.716479 True True
10 60 0.017029 False False
11 67 0.039739 False False
12 86 0.253573 False False
13 139 0.881182 True True
14 1 0.000005 False False
Summary of the model with 4 features on the validation set:
Accuracy : 1.0
Log loss of the model : 0.10051287143887465
metrics_summary = f"""| Model | Log-Loss|
|\n|-------|----------|\n"""
for result in metrics:
metrics_summary += "| {Model} | {Log-Loss} |\n".format(**result)
print(metrics_summary)
| Model | Log-Loss| | |-------|----------| | Model Feature=1 | 0.3832909328304838 | | Model Feature=2 | 0.42936650538756177 | | Model Feature=3 | 0.14023623382003317 | | Model Feature=4 | 0.10051287143887465 |
# Create a figure
fig = plt.figure(figsize=(18, 6))
for i, model in models.items():
if i == 1:
ax = fig.add_subplot(1, 2, 1)
decision_boundary = -model.intercept_ / model.coef_
ax.plot(decision_boundary, color='black')
ax.scatter(X_val.iloc[:, :i], [i]*len(X_val), c=y_val, edgecolors='k', linewidth=1, alpha=0.6)
ax.set_title('Decision Boundary for Model with 1 Feature')
elif i == 2:
ax = fig.add_subplot(1, 2, 2)
decision_boundary_x1 = np.linspace(X_val.iloc[:, :i].min(), X_val.iloc[:, :i].max(), 100)
decision_boundary_x2 = -model.intercept_ / model.coef_[0][1] - model.coef_[0][0] / model.coef_[0][1] * decision_boundary_x1
ax.plot(decision_boundary_x1, decision_boundary_x2, color='black')
ax.set_title('Decision Boundary for Model with 2 Features')
plt.show()
elif i == 3:
x1, x2 = np.meshgrid(np.linspace(X_val.iloc[:, :i].min(), X_val.iloc[:, :i].max(), 10), np.linspace(X_val.iloc[:, :i].min(), X_val.iloc[:, :i].max(), 10))
decision_boundary_x3 = -model.intercept_ / model.coef_[0][2] - model.coef_[0][0] / model.coef_[0][2] * x1 - model.coef_[0][1] / model.coef_[0][2] * x2
df = pd.DataFrame({
'X1': x1.flatten(),
'X2': x2.flatten(),
'Decision Boundary': decision_boundary_x3.flatten()
})
# Create the 3D scatter plot
fig = px.scatter_3d(df, x='X1', y='X2', z='Decision Boundary')
fig.update_layout(title_text='Decision Boundary for Model with 3 Features')
fig.show()
The model with 1 and 2 features made 3 incorrect predictions out of 15
Model with 3 and 4 features had given accurate predictions of all 15 instances. This suggests that the additional features in the model 3 and 4 may be providing important information that helps the model correctly classify this instance.
Based on the accuracy and logloss, the model with 4 features seems to be the best choice. Here’s why:
It has given accurate predictions on the validation set compared to the models with 1 and 2 features. It uses more information (i.e., more features) to make its predictions, which generally leads to better performance. Also it has the lowest log loss among the 4 models of approximately 0.1005 on the validation set
To summarize the results of best model on the test set, we would use the same process as we did for the validation set.
X_test_i = X_test.iloc[:, :4]
proba = models[4].predict_proba(X_test_i)[:, 1]
pred = models[4].predict(X_test_i)
# Calculate the accuracy of the model on the test set
accuracy_test = np.mean(pred == y_test)
print(f"Accuracy of the model with 4 features on the test set: {accuracy_test}")
# Calculate the log loss
logloss_test = log_loss(y_test, proba)
print(f"Log loss of the model with 4 features on the test set: {logloss_test}")
table = pd.DataFrame({
'Instance number': X_test_i.index ,
'Probability of virginica': proba,
'Model prediction': pred,
'Ground truth': y_test})
print(f"Model with 4 features on test set:")
print(table)
Accuracy of the model with 4 features on the test set: 1.0
Log loss of the model with 4 features on the test set: 0.11796331617414953
Model with 4 features on test set:
Instance number Probability of virginica Model prediction Ground truth
0 73 0.213315 False False
1 18 0.000007 False False
2 118 0.998688 True True
3 78 0.218776 False False
4 76 0.306342 False False
5 31 0.000007 False False
6 64 0.017760 False False
7 141 0.832739 True True
8 68 0.310019 False False
9 82 0.035147 False False
10 110 0.735148 True True
11 12 0.000005 False False
12 36 0.000002 False False
13 9 0.000005 False False
14 19 0.000004 False False
The model that performed the best was the Model with 4 features. It had an Accuracy value of 1 on the test set, which means it has correctly classified all instances of test set. Also the model has a lower log loss of approximately 0.11 on the test set.